AITopics | moderation guardrail

Jailbreaking Large Language Models Against Moderation Guardrails via Cipher Characters

Neural Information Processing SystemsMar-21-2026, 01:01:34 GMT

Large Language Models (LLMs) are typically harmless but remain vulnerable to carefully crafted prompts known as ``jailbreaks'', which can bypass protective measures and induce harmful behavior. Recent advancements in LLMs have incorporated moderation guardrails that can filter outputs, which trigger processing errors for certain malicious questions. Existing red-teaming benchmarks often neglect to include questions that trigger moderation guardrails, making it difficult to evaluate jailbreak effectiveness. To address this issue, we introduce JAMBench, a harmful behavior benchmark designed to trigger and evaluate moderation guardrails. JAMBench involves 160 manually crafted instructions covering four major risk categories at multiple severity levels. Furthermore, we propose a jailbreak method, JAM (Jailbreak Against Moderation), designed to attack moderation guardrails using jailbreak prefixes to bypass input-level filters and a fine-tuned shadow model functionally equivalent to the guardrail model to generate cipher characters to bypass output-level filters. Our extensive experiments on four LLMs demonstrate that JAM achieves higher jailbreak success ($\sim$ $\times$ 19.88) and lower filtered-out rates ($\sim$ $\times$ 1/6) than baselines.

artificial intelligence, large language model, natural language, (4 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

6d56bc83ae9a4fafdce050bb36f04174-Paper-Conference.pdf

Neural Information Processing SystemsFeb-15-2026, 16:17:56 GMT

large language model, machine learning, natural language, (17 more...)

Neural Information Processing Systems

Country:

North America > United States > Illinois > Champaign County > Champaign (0.04)
North America > United States > Illinois > Champaign County > Urbana (0.04)

Genre:

Research Report > Experimental Study (0.93)
Research Report > New Finding (0.93)

Industry:

Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
Law (1.00)
Health & Medicine > Therapeutic Area (0.93)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Security & Privacy (0.93)

Add feedback

AIhub monthly digest: January 2026 – moderating guardrails, humanoid soccer, and attending AAAI

AIHubJan-30-2026, 10:36:40 GMT

Find out more about our session on Wednesday 21 January.

artificial intelligence, guardrail, machine learning, (14 more...)

AIHub

Country:

Asia > Singapore (0.07)
North America (0.05)

Genre:

Personal > Honors (0.51)
Personal > Interview (0.31)

Industry: Leisure & Entertainment > Sports > Soccer (0.56)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.31)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.30)

Add feedback

Interview with Anindya Das Antar: Evaluating effectiveness of moderation guardrails in aligning LLM outputs

AIHubJan-16-2026, 09:48:23 GMT

In their paper presented at AIES 2025, "Do Your Guardrails Even Guard?" Method for Evaluating Effectiveness of Moderation Guardrails in Aligning LLM Outputs with Expert User Expectations, Anindya Das Antar Xun Huan and Nikola Banovic propose a method to evaluate and select guardrails that best align LLM outputs with domain knowledge from subject-matter experts. Here, Anindya tells us more about their method, some case studies, and plans for future developments. Could you give us some background to your work - why are guardrails such an important area for study? Ensuring that large language models (LLMs) produce desirable outputs without harmful side effects and align with user expectations, organizational goals, and existing domain knowledge is crucial for their adoption in high-stakes decision-making. However, despite training on vast amounts of data, LLMs can still produce incorrect, misleading, or otherwise unexpected and undesirable outputs.

guardrail, llm output, moderation guardrail, (13 more...)

AIHub

Country:

North America > United States > Michigan (0.05)
Europe (0.05)

Industry:

Health & Medicine (0.35)
Leisure & Entertainment > Sports > Soccer (0.30)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Jailbreaking Large Language Models Against Moderation Guardrails via Cipher Characters

Neural Information Processing SystemsOct-10-2025, 05:24:00 GMT

Large Language Models (LLMs) are typically harmless but remain vulnerable to carefully crafted prompts known as "jailbreaks", which can bypass protective measures and induce harmful behavior.

arxiv preprint arxiv, cipher character, moderation guardrail, (12 more...)

Neural Information Processing Systems

Country:

North America > United States > Illinois > Champaign County > Champaign (0.04)
North America > United States > Illinois > Champaign County > Urbana (0.04)

Genre:

Research Report > Experimental Study (0.93)
Research Report > New Finding (0.93)

Industry:

Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
Law (1.00)
Health & Medicine > Therapeutic Area (0.93)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Exploring the Vulnerability of the Content Moderation Guardrail in Large Language Models via Intent Manipulation

Zhuang, Jun, Jin, Haibo, Zhang, Ye, Kang, Zhengjian, Zhang, Wenbin, Dagher, Gaby G., Wang, Haohan

arXiv.org Artificial IntelligenceAug-26-2025

Intent detection, a core component of natural language understanding, has considerably evolved as a crucial mechanism in safeguarding large language models (LLMs). While prior work has applied intent detection to enhance LLMs' moderation guardrails, showing a significant success against content-level jailbreaks, the robustness of these intent-aware guardrails under malicious manipulations remains under-explored. In this work, we investigate the vulnerability of intent-aware guardrails and demonstrate that LLMs exhibit implicit intent detection capabilities. We propose a two-stage intent-based prompt-refinement framework, IntentPrompt, that first transforms harmful inquiries into structured outlines and further reframes them into declarative-style narratives by iteratively optimizing prompts via feedback loops to enhance jailbreak success for red-teaming purposes. Extensive experiments across four public benchmarks and various black-box LLMs indicate that our framework consistently outperforms several cutting-edge jailbreak methods and evades even advanced Intent Analysis (IA) and Chain-of-Thought (CoT)-based defenses. Specifically, our "FSTR+SPIN" variant achieves attack success rates ranging from 88.25% to 96.54% against CoT-based defenses on the o1 model, and from 86.75% to 97.12% on the GPT-4o model under IA-based defenses. These findings highlight a critical weakness in LLMs' safety mechanisms and suggest that intent manipulation poses a growing challenge to content moderation guardrails.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2505.18556

Country: North America > United States > Illinois (0.28)

Genre: Research Report > New Finding (1.00)

Industry:

Information Technology (1.00)
Government > Military (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Jailbreaking Large Language Models Against Moderation Guardrails via Cipher Characters

Neural Information Processing SystemsMay-27-2025, 04:33:19 GMT

Large Language Models (LLMs) are typically harmless but remain vulnerable to carefully crafted prompts known as jailbreaks'', which can bypass protective measures and induce harmful behavior. Recent advancements in LLMs have incorporated moderation guardrails that can filter outputs, which trigger processing errors for certain malicious questions. Existing red-teaming benchmarks often neglect to include questions that trigger moderation guardrails, making it difficult to evaluate jailbreak effectiveness. To address this issue, we introduce JAMBench, a harmful behavior benchmark designed to trigger and evaluate moderation guardrails. JAMBench involves 160 manually crafted instructions covering four major risk categories at multiple severity levels.

large language model, moderation guardrail, natural language, (2 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Jailbreaking Large Language Models Against Moderation Guardrails via Cipher Characters

Jin, Haibo, Zhou, Andy, Menke, Joe D., Wang, Haohan

arXiv.org Artificial IntelligenceMay-30-2024

Large Language Models (LLMs) are typically harmless but remain vulnerable to carefully crafted prompts known as ``jailbreaks'', which can bypass protective measures and induce harmful behavior. Recent advancements in LLMs have incorporated moderation guardrails that can filter outputs, which trigger processing errors for certain malicious questions. Existing red-teaming benchmarks often neglect to include questions that trigger moderation guardrails, making it difficult to evaluate jailbreak effectiveness. To address this issue, we introduce JAMBench, a harmful behavior benchmark designed to trigger and evaluate moderation guardrails. JAMBench involves 160 manually crafted instructions covering four major risk categories at multiple severity levels. Furthermore, we propose a jailbreak method, JAM (Jailbreak Against Moderation), designed to attack moderation guardrails using jailbreak prefixes to bypass input-level filters and a fine-tuned shadow model functionally equivalent to the guardrail model to generate cipher characters to bypass output-level filters. Our extensive experiments on four LLMs demonstrate that JAM achieves higher jailbreak success ($\sim$ $\times$ 19.88) and lower filtered-out rates ($\sim$ $\times$ 1/6) than baselines.

arxiv preprint arxiv, jailbreak prompt, moderation guardrail, (12 more...)

arXiv.org Artificial Intelligence

2405.20413

Country:

North America > United States > Illinois > Champaign County > Champaign (0.04)
North America > United States > Illinois > Champaign County > Urbana (0.04)

Genre: Research Report > New Finding (0.67)

Industry:

Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
Law (1.00)
Health & Medicine > Therapeutic Area (0.94)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.99)

Add feedback

Filters

Collaborating Authors

moderation guardrail

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Jailbreaking Large Language Models Against Moderation Guardrails via Cipher Characters

6d56bc83ae9a4fafdce050bb36f04174-Paper-Conference.pdf

AIhub monthly digest: January 2026 – moderating guardrails, humanoid soccer, and attending AAAI

Interview with Anindya Das Antar: Evaluating effectiveness of moderation guardrails in aligning LLM outputs

Jailbreaking Large Language Models Against Moderation Guardrails via Cipher Characters

Exploring the Vulnerability of the Content Moderation Guardrail in Large Language Models via Intent Manipulation

Jailbreaking Large Language Models Against Moderation Guardrails via Cipher Characters

Jailbreaking Large Language Models Against Moderation Guardrails via Cipher Characters